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C/3 \ Abstract 

Directed acyclic graphs (DAGs) are a popular framework to express multivariate probability distributions. Acyclic 
directed mixed graphs (ADMGs) are generalizations of DAGs that can succinctly capture much richer sets of condi- 
tional independencies, and are especially useful in modeling the effects of latent variables implicitly. Unfortunately 
there are currently no good parameterizations of general ADMGs. In this paper, we apply recent work on cumulative 
distribution networks and copulas to propose one general construction for ADMG models. We consider a simple 
parameter estimation approach, and report some encouraging experimental results. 

00 

O ; 1 Contribution 

O 

Graphical models provide a powerful framework for encoding independence constraints in a multivariate distribution 
IfTTl IT~4l . Two of the most common families, the directed acyclic graph (DAG) and the undirected network, have 
complementary properties. For instance, DAGs are non-monotonic independence models, in the sense that conditioning 
on extra variables can also destroy independencies (sometimes known as the "explaining away" phenomenon IfTTl ). 
Undirected networks allow for flexible "symmetric" parameterizations that do not require a particular ordering of the 
variables. 

More recently, alternative graphical models that allow for both directed and symmetric relationships have been 
introduced. The acyclic directed mixed graph (ADMG) has both directed and bi-directed edges and it is the result of 
marginalizing a DAG: Figure [JJ provides an example. [21 19| show that DAGs are not closed under marginalization, 
but ADMGs are. Reading off independence constraints from a ADMG can be done with a procedure essentially 
identical to d-separation IfTTl |2D . 

Theoretical properties and practical applications of ADMGs are further discussed in detail by e.g. J2] [25] [5] [29] 
|T8l [24l [10]. One can also have latent variable ADMG models, where bi-directed edges represent a subset of latent 
variables that have been marginalized. In sparse models, using bi-directed edges in ADMGs frees us from having to 
specify exactly which latent variables exist and how they might be connected. In the context of Bayesian inference, 
Markov chain Monte Carlo in ADMGs might have much better mixing properties compared to models where all latent 
variables are explicitly included [24- 1 . 

However, it is hard in general to parameterize a likelihood function that obeys the independence constraints encoded 
in an ADMG. Gaussian likelihood functions and their variations (e.g., mixture models and probit models) have been 
the only families exploited in most of the literature 12T1 [24l . The contribution of this paper is to provide a flexible 
construction procedure to design probability mass functions and density functions that are Markov with respect to 
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Figure 1: (a) A DAG representing dependencies over a set of variables (adapted from [25], page 137) in a medical 
domain, (b) The ADMG representing conditional independencies corresponding to (a), but only among the remaining 
vertices: pollution and genotype factors were marginalized. In general, bi-directed edges emerge from unspecified 
variables that have been marginalized but still have an effect on the remaining variables. The ADMG is acyclic in the 
sense that there are no cycles composed of directed edges only. In general, a DAG cannot represent the remaining set 
of independence constraints after some variables in another DAG have been marginalized. 

an arbitrary ADMG. This is done by exploiting recent work on cumulative distribution networks (9) and copulas 
|[T6l[T3l . We also provide a straightforward approach to learning in our ADMGs inspired by the parameter estimation 
approaches in the copula literature. We review mixed graphs and cumulative distribution networks in Section [2] The 
full formalism is given in detail in Section [3] An instantiation of the framework based on copulas and a parameter 
estimation procedure is described in Section|4] Experiments are described in Section[3] and we conclude with Section 



In this section, we provide a summary of the relevant properties of mixed graph models and cumulative distribution 
networks, and the relationship between formalisms. 

A bi-directed graph is a special case of a ADMG without directed edges. The absence of an edge (Xi, Xj) implies 
that Xi and Xj are marginally independent. Hence, bi-directed models are models of marginal independence [5 1. Just 
like in a DAG, conditioning on a vertex that is the endpoint of two arrowheads will make some variables dependent. 
For instance, for a bi-directed graph X\ O Xi <H> X3, we have that X\ _U_ X3 but X\ JL X^\X2- See (4, 5 j for a full 
discussion. 

Current parameterizations of bi-directed graphs suffer from a number of practical difficulties. For example, con- 
sider binary bi-directed graphs, where a complete parameterization was introduced by Drton and Richardson 0. Let 
Q be a bi-directed graph with vertex set Xy. Let qA = P(Xa = 0), for any vertex set Xa contained in Xy. The joint 
probability P(Xa — 0, Xy\A — 1) is given by 



The set {q$ : S C S} is known as the Mobius parameterization of P(Xy), since relationship (fl~|i is an instance of the 
Mobius inversion operation 0141 . The marginal independence of the bi-directed graph implies P(Xa = 0, Xb = 0) = 
P(Xa — 0)P(Xb = 0) if no element in Xa is adjacent to any element in Xb in Q. Therefore, the set of independent 
parameters in this parameterization is given by {qA}, for all Xa that forms a connected set in Q. This parameterization 
is complete, in the sense that any binary model that is Markov with respect to Q can be represented by the set {qA}- 
However, this comes at a price: in general, the number of connected sets can grow exponentially in \Xy\ even for a 
sparse, tree-structured, graph. Moreover, the set {qa} is not variation independent [14|: the parameter space is defined 
by exponentially many constraints. In contrast, different conditional probability tables in a given Bayesian network 
can be parameterized independently fl4llT7l . 

Cumulative distribution networks (CDNs), introduced by Huang and Frey [9| as a convenient family of cumula- 
tive distribution functions (CDFs), provide a alternative construction of bi-directed models by indirectly introducing 
additional constraints to reduce the total number of parameters. Let Xy be a set of random variables, and let Q be a 
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bi-directed graprQ with C being a set of cliques in Q. The CDF over Xy is given by 

P(X V < x v ) = F(x v ) = J] F s (xs) (2) 

sec 

where each F$ is a parametrized CDF over X$. A sufficient condition for to define a valid CDF is that each 
Fs is itself a CDF. CDNs satisfy the conditional independence constraints of bi-directed graphs |9|. For example, 
consider X\ o X2 <H- X3, with cliques Xs t = {Xi, X 2 } and X$ 2 — {X2; X3}. The marginal CDF of X\ and X3 is 
P(Xi < xi,X 3 < £3) = P(Xi < xi,X 2 < 00, X 3 < x 3 ) = Fi(xi,oo)F 2 (oo,x 3 ). Since this factorizes, it follows 
that X\ and X3 are marginally independent. 

The relationship between the complete parameterization of Drton and Richardson and the CDN parameterization 
can be exemplified in the discrete case. Let each Xi take values in {0, 1, 2, ...}. Recall that the relationship between a 
CDF and a probabiliy mass function is given by the following inclusion-exclusion formula lfl2l : 

11 1 

p{x u ...,x d ) = J2 J2---Y,(- 1 ¥ 1+Z2+ ''' ZdF ( x i-*i> x 2-z*>---> x <i-^> w 

zi—0 Z2—O z d —0 

for d = \Xy\. In the binary case, since qA — P(Xa = 0) = P(Xa < 0, X v \a < 1) = F(xa = 0,x v \a = 1), 
one can check that (fjjl and (Q3 are the same expression. The difference between the CDN parameterization [9 1 and the 
complete parameterization [5 1 is that, on top of enforcing qAuB — IaIb for Xa disconnected from Xb, we have the 
additional constraints 

QA = ]J qA o ( 4 ) 

A c ec(A) 

for each connected set Xa, where C(A) are the maximal cliques in the subgraph obtained by keeping only the vertices 
Xa and the corresponding edges from £0. 

As a framework for the construction of bi-directed models, CDNs have three major desirable features. Firstly, the 
number of parameters grows with the size of the largest clique, instead of \Xy\, Secondly, parameters in different 
cliques are variation independent, since (f2|i is well-defined if each individual factor is a CDF. Thirdly, this is a general 
framework that allows not only for binary variables, but continuous, ordinal and unbounded discrete variables as 
well. Finally, in graphs with low tree-widths, probability densities/masses can be computed efficiently by dynamic 
programming |9j. To summarize, CDNs provide a restricted family of marginal independence models, but one that 
has computational, statistical and modeling advantages. Depending on the application, the extra constraints are not 
harmful in practice, as demonstrated by 1 10 1. 

3 Mixed Cumulative Distribution Models 

In what follows, we will extend the CDN family to general acyclic directed mixed graphs: the mixed cumulative 
distribution network (MCDN) model. In Section [XTl we describe a higher-level factorization of the probability (mass 
or density) function P(Xy) involving subgraphs of Q. In Section [3~2l we describe cumulative distribution functions 
that can be used to parameterize each factor defined in Section [3~T1 in the special case where no directed edges exist 
between members of a same subgraph. Finally, in Section [3~3l we describe the general case. 

Some important notation and definitions: there are two kinds of edges in an ADMG; either Xk — > Xj or X% O Xj. 
In the former case (but not the latter) we call Xk a parent of Xj. We use pag(XA) to represent the parents of a set 
of vertices Xa in graph Q. For a given Q, (Q)a represents the subgraph obtained by removing from Q any vertex not 
in set A and the respective edges; (Q)^ is the subgraph obtained by removing all directed edges. We say that a set of 
nodes A in Q is an ancestral set if it is closed under the ancestral relationship: if X v G A, then all ancestors of X v in Q 
are also in A. Finally, define the districts of a graph Q as the connected components of (G)^- Hence each district is a 
set of vertices, Xr>, such that if Xi and Xj are in Xu then there is a path connecting Xj and Xj composed entirely of 
bi-directed edges. Note that trivial districts are permitted, where Xjj — {Xi}. Associated with each district Xo t is a 
subgraph Qi consisting of nodes X^t Upag^XoJ. The edges of Qi are all of the edges of {Q)x D xipa g {x D ) excluding 
all edges among pag (Xi> i )\XD i . Two examples are shown in Figure|2] 

1 1 9 1 describe the model in terms of factor graphs, but for our purposes a bi-directed representation is more appropriate. 
2 This property was called min-independence in (§]. 
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(a) (b) (c) 



Figure 2: (a) The ADMG has two districts, X£ >i = {Xi, X2] with singleton parent X4, and Xrj 2 = {X3, X4} with 
parent X x . (b) A more complicated example with two districts. Notice that the district given by X^ ll = {Xi, X 2 , X 3 } 
has as external parent X 4 , but internally some members of the district might be parents of other members. The other 
district is a singleton, Xjj 2 — {X4}. (c) The two corresponding subgraphs Q\ and Q2 are shown here. 

3.1 District factorization 

Given any ADMG Q with vertex set Xy, we parameterize its probability mass/density function as: 

K 

P{X V ) =Y[Pi(X Di I pa g (X Dz )\X Di ) (5) 
i=i 

where {Xrj 1 , Xrj 2 Xd k } is the set of districts of Q. That is, each factor is a probability (mass/density) function 
for Xjji given its set of parents in Q (that are not already in XpJ. We require that 

• Each Pi{Xo i pag(Xo i )\Xi) i ) is Markov with respect to Qi, 

where a probability function P(-) is Markov with respect to a ADMG Q if any conditional independence constraint 
encoded in Q is exhibited in P(-). 

The relevance of this factorization is summarized by the following result. 

Proposition 1. A probability function P{Xy) is Markov with respect to Q if it can be facto rized according to (O and 
each Pi (X]j i \ pag (Xij i )\Xr) i ) is Markov with respect to the respective Qi. 

Proofs of all results are in Appendix A. 

Note that (O is seemingly cyclical: for instance, Figure|2a) implies the factorization Pi (Xi,X2 | X4)P2(Xj, : X4 | Xi). 
This suggests that there are additional constraints tying parameters across different factors. However, there are no such 
constraints, as guaranteed through the following result: 

Proposition 2. Given an ADMG Q with respective subgraphs {Qi} and districts {Xrii}, any collection of probability 
functions Pi(X]j i \ pag{X£> i )\X£ )i ), Markov with respect to the respective Qi, implies that @ is a valid probability 
function ( a non-negative function that integrates to 1 ). 

The implication is that one can independently parameterize each individual Pi(- \ •) to obtain a valid P(Xy) 
Markov with respect to any given ADMG Q. In the next sections, we show how to parameterize each Pi(- | ) by 
factorizing its corresponding cumulative distribution function. 



3.2 Models with barren districts 



Consider first the case where district Xr> i is barren, that is, no X v e Xr> i has a parent also in Xd i 11201 . For a given 
Qi with respective district Xrj i , consider the following function: 



Fi(x Di I pa g (X Di )) 



H F s (x s \pag(X D J) 
.x s ed 



Y[ F v (x v I pag(X v )) 



x v ex L 



(6) 



where Cj is the set of cliques in (Qi)^. Each term on the right hand side is a conditional cumulative distribution 
function: for sets of random variables Y and Z, F(y | z) = P(Y < y \ Z — z). 
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Figure 3: (a) A mixed graph with a single district that includes all five vertices, (b) The modified graph after including 
artificial vertices (artificial vertices for childless variables are ignored), (c) A display of the four districts of the modified 
graph in individual boxes. All districts are now barren, i.e., no directed edges can be found within a district. 

Proposition 3. Fi(xjj i ) is a CDF for any choice of {{Fs(xs)}, {F v (x v | pag(X v ))}}. If according to each Fs(xs), 
X s 6 Xs is marginally independent of any element in pag{XB i )\pa.g{X s ), the corresponding conditional probability 
function Fi(xOi \ PO-g(XDi)) is Markov with respect to Qi. 

Notice that the structure of type IV chain graphs (3) is a special case of ADMGs with barren districts. The 
parameterization of (3J is complete for such graph models, but requires exponentially many parameters even in sparse 
models. 

To obtain the probability function ©, we calculate each Pi(Xu i | pag(Xu i )\Xu i ) by differentiating the corre- 
sponding © with respect to X^i ■ Although this operation, in the discrete case, is in the worst-case exponential in 
\Xd ( |, it can be performed efficiently for graphs where (G)^ has low tree-width |9). 

3.3 The general case: reduction to barren case 

We reduce graphs with general districts to graphs with only barren districts by introducing artificial vertices. Create a 
graph Q* with the same vertex set as Q and the same bi-directed edges. For each vertex X v in Q, perform the following 
operation: 

• add an artificial vertex X* to Q*; 

• add the edge X v X* to Q*, and make the children of X* to be the original children of X v in Q; 

• define the model P(Xy, X v ) to have the same factors <(5j as P(Xy), but substituting every occurrence of X v 
in pag(Xi> i ) by the corresponding pag* (^CdJ- Moreover, define P*(X* \ X v ) such that 

P*(X* = x\X v = x) = l (7) 

K 

P(X V , X v ) = [] Pi(X Di I pag, {X Di )\X D% ) H P:(X: I X v ) (8) 

i=l X V £X V 

Since the last group of factors is identically equal to 1, they can be dropped from the expression. 

From (0, it follows that P(Xy — xy,X v = xy) — P(Xy = xy)- Since no two vertices in the same district can 
now have a parent-child relation, all districts in Q* are barren and as such we can parameterize P(Xy — xy, X v = 
xy) according to the results of the previous section. A similar trick was exploited by ll24l to reduce a problem of 
modeling ADMG probit models to Gaussian models. 

Figure [3] provides an example, adapted from ||20l . The graph has a single district containing all vertices. The 
corresponding transformed graph generates several singleton districts composed of one artificial variable either. In 
Figure[3lc), we rearrange such districts to illustrate the decomposition described in Section [3~T| 
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4 Copula MCDNs and Parameter Estimation 

The main result of Section[3]is that we can parameterize a MCDN model by parameterizing the factors in Equation ^ 
corresponding to each district, which are then tied together by the joint model ([8j. However, we have not yet specified 
how to construct each F$ and F v . In this section, we describe a particularly convenient way of parameterizing such 
factors. We introduce copula MCDN models - a particular instantiation of the MCDN family - and how to estimate its 
parameters. 

Copulas are a flexible approach to defining dependence among a set random variables. This is done by specifying 
the dependence structure and the marginal distributions separately 1T61 (see also lfl3l for a machine learning perspec- 
tive). Simply put, a copula function C(u\, . . . , u t ) is just the CDF of a set of dependent random variables, each with 
the uniform marginal distribution over [0, 1]. To define a joint distribution over a set of variables {X v } with arbitrary 
marginal CDFs F v (x v ), we simply transform each X v into a uniform variable u v over [0, 1] using u v = F v (xi). The 
resulting joint CDF F(x\, . . . , Xt) — C(Fi(xi), . . . , F t {xt)) incorporates both the dependence encoded in C and the 
marginal distributions F v . 

Returning to ADMGs, let Qi be the subgraph corresponding to a barren district Xo i . We parameterize a conditional 
CDF Fi (xDi \pag (Xp^) of form © Markov with respect to Qi by defining the marginal CDFs and copula dependence 
separately. In our implementation the marginal probability for binary or ordinal X v is an unconstrained conditional 
probability mass function. The ordering over the values of X v , ^ , naturally defines the marginal F v (x v \pag (X v ) ) : 

F v (x v \pa g (X v j) = J2 if*^ (9) 

where 77 are the marginal parameters; conditioned upon the parents of X v , r^ LglyXv ^ is simply the probability that X v = 
x. In our implementation for continuous X v , we define the marginal F v (x v \pag (X v )) using conditional Gaussians: 

F v {x v I pa g (X v )) = $(x v ; J2f=i Vvj</>j{pag{X v )),a%), (10) 

with variance o\ and mean given by a linear regressor of fixed basis functions <f>j(-). 

For a copula with the required bi-directed dependence among X]j i , we adopt the approach of product copulas |fl5ll . 
For each clique S in Qi let Cs(us) be a 15*1 -dimensional copula. Let d v be the number of cliques variable X v is in and 
define a v = ul^ dv+1 \ The product of copulas given by: 

C Dl (u D J = J] C s (a s ) H a v (11) 
sec, veD t 

can be shown to be a copula itself [ 15 ]. Plugging in the marginal distributions by defining u v = F v (x v \ pag (XdJ), 
the joint CDF over x _D i becomes: 



Fi{xD z pag(X Dl )) = 


n cfcM 




n ay 


where a v = F v (x v \ pag{X v )) l '^ + V 


(12) 








veD, . 







The joint CDF has the form (O required to be Markov with respect to Qi. 

We take an easy approach to parameter estimation commonly employed in the copula literature: 

1. fit the (conditional) marginals in (O or ( TTOb individually (by maximizing likelihood); 

2. calculate the corresponding "pseudodata" a v ; 

3. plug the estimated "pseudodata" into (fT2l . and maximize the likelihood of the product copula (fT2l . Note that 
information from the parents has been absorbed into the calculation of a v via (O or ( fTob - 

Although the result is not a maximum likelihood estimator, it is a practical procedure that does give consistent esti- 
mators [ 1 3 1 . Given the pseudodata, the third step is maximum likelihood estimation of a CDN model as discussed by 
iflOll . In our implementation, used in Section |5J we substitute Step 3 by something even simpler to program^ while 
providing a proof of concept for the feasibility of Bayesian procedures: we put a prior over the copula parameters and 
do Metropolis-Hastings (MH) with a Gaussian random walk proposal. To calculate the MH ratio we only need the 
likelihood function, which again can be obtained from the message-passing scheme of ll9l [T0l . 

3 Maximum likelihood estimation requires the gradient of the density with respect to the parameters. That is, we need derivatives on top of the 
message-passing scheme that transforms a CDF into a density function 1 10 1. 
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Table 1: Average difference in log predictive per observations (in millibits) and standard errors. # v is the number of 
variables and # b is the number of bi-directed edges in the ADMG. 



Data set 


#v 


#b 


Variables marginalized out 


E [A DAG ] ± s.e. 


Insurance 


25 


2 


Driving Skill, Mileage 


72.72 ±17.15 


Alarm 


33 


4 


Err Cauter, TPR, KinkedTube, ArtC02 


76.27 ±13.78 


BreastCancer 


10 


5 


structure inferred 


686.50 ±76.62 


SPECTF 


44 


25 


structure inferred 


-21.14±25.74 




Figure 4: DAG (left) and ADMG (right) structures inferred from the Wisconsin breast cancer data set. 



5 Experiments 

We evaluate the usefulness of the MCDN formalism by comparing the K-fo\d cross validated log-predictive proba- 
bilities of copula MCDNs and DAGs on four data sets. Two data sets are synthetic (from the alarm [ 1 1 and insurance 
networks ifTTl ) so that the ground truth structure is known and we can compare against an overparameterized DAG. The 
non-synthetic data sets are both from the UCI repository (the Wisconsin breast cancer and SPECTF data sets Il26l l6ll). 
All data sets, except for the SPECTF data set which is continuous, consist of ordinal or binary variables. 

In our experiments, copula MCDNs are parameterized as described in (O or fllOt . and (1121 1. We use Frank copulas, 
for computational convenience, with Gaussian A/"(0, 10) priors on their parameters 9. 

Known structure Several common cause variables (listed in table [TJ were marginalized out of the data to introduce 
bi-directed edges to the true structure. An overparameterized DAG is able, parametrically, to capture a broader set 
of conditional dependencies (by having additional edges as well as broader parameterization) than those of a copula 
MCDN; however it has many more parameters (exponential in the parents of the district of the corresponding MCDN). 
Hence we compare these models on a small sample size of 300. 

The difference, in millibits, of the log predictive probability between that of the copula MCDN and of the overpa- 
rameterized DAG, per cross-validation test set, is calculated as follows: 

Adag = [\og 2 p{x k \V k ^ k ,mT)n) - log 2 p(a; fc |2? fc ,?7 fc ,DAG)] 
n k 

where x k and V k are the fcth test and training set, respectively, and r\ k are the maximum likelihood parameters of the 
marginals from V k . 

We calculate the predictive probability of the data set, p(x k \T> k , r) k , MCDN), by averaging p(x k \T> k , rj kl 9, MCDN) 
over samples of the copula parameter 9. Positive Adag tells us on average how many millibits better the prediction 
from the MCDN is over the DAG model. In both cases the log predictive probabilities were significantly higher, 
although slight. Comparing to a DAG with marginal parameters marginalized produced the same numbers (up to 5 s.f.) 
shown in table Q] 
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Unknown structure Next we ran an experiment on ordinal data without known structure. We used the original 
Wisconsin breast cancer data set from the UCI repository [26]. The ADMG and DAG structures shown in figure |4] 
were inferred using MBCS* ifTHll and the \ 2 test- We then repeated the procedure described above, instead calculating 
Adag relative to the inferred DAG rather than an overparameterized DAG, to obtain the results also shown in table [TJ 
On average, the model performed encouragingly. 

Finally, we used the SPECTF continuous data set from the UCI repository [6]. We used this data in a more realistic 
fashion: instead of learning the structure from the entire data set then performing predictions of subsets, the structure 
learning is incorporated into the if -fold cross validation. We used K — 5 for this experiment and a score-based 
structure learning algorithm ll22l to find the DAG followed by fitting the bi-directed edges using the residuals with 
the directed structure fixed. Furthermore, if districts were not tree-structures, they were thinned into trees (ordered 
by weakest residuals). The residuals were fit by testing marginal independence using [7|. This combined technique 
allowed the structure to be inferred efficiently. 

We compared this copula MCDN to a Gaussian DAG model (fit using just the DAG learning algorithm of |22| and 
maximum likelihood). The results are shown in tableQ] The number of bi-directed edges given is the average over the 
K = 5 cross validation folds. 

In this case, the copula MCDN performed worse than the DAG model. Note that the fitting procedure is suboptimal 
for MCDNs and, for computational efficiency, does not alternate between learning directed and bi-directed edges, 
and the bi-directed structure is limited to tree-structured. We also tried fitting a copula CDN, that is, omitting the 
DAG search step and just fitting the residuals. Compared to this model, the MCDN had an average difference of 
11, 504 ± 2, 456 millibits suggesting that the DAG marginals are dominating the copula MCDN fit on these data. 

6 Conclusion 

Acyclic directed mixed graphs are a natural generalization of DAGs. While ADMGs date back at least to 1271 , the 
potential of this framework has only recently being translated into practical applications due to advances into complete 
parameterizations of Gaussian and discrete networks [21 5, 20|. The framework of cumulative distribution networks 
(9] [Toll introduced new approaches for more constrained by widely applicable families of marginal independence (bi- 
directed) models. By extending CDNs to the full ADMG case, we expect that ADMGs will be readily accessible and 
as widespread as DAG models. 

There are several directions for future work. While classical approaches for learning Markov equivalence classes 
of ADMGs have been developed by means of multiple hypothesis tests of conditional independencies l25l , a model- 
based approach based on Bayesian or penalized likelihood functions can deliver more robust learning procedures and 
a more natural way of combining data with structural prior knowledge. ADMG structures can also play a role in 
multivariate supervised learning, that is, structured prediction problems. For instance, |23| introduced some simple 
models for relational classification inspired by ADMG models and by the link to seemingly unrelated regression |28 1. 
However, efficient ADMG-structured prediction methods and new advanced structural learning procedures will need 
to be developed. 
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APPENDIX A - PROOFS 

Proposition 1. A probability function P(Xv) is Markov with respect to Q if it can be factorized according to given that each 
PF^Xoi | pag(X Di)\X Di) is Markov with respect to Qi. 

Before we prove this theorem, we need to state the following result from 1191 . Given an ancestral set A, the Markov blanket 
of vertex X v in A, mb(X v , A), is given by the district of X v in (G)a (except X v itself) along with all parents of elements of this 
district. Let a total ordering -< of the vertices of Q be any ordering such that if X v -< Xt, then X t is not an ancestor of X v in Q. 
A probability measure is said to satisfy the ordered local Markov condition for Q with respect to -< if, for any X v and ancestral set 
A such that Xt G A\{X V } =>• Xt -< X v , we have X v is independent of A\(mb(X v , A) U given mb(X v , A). The main 

result from 1 19 1 states: 

Theorem 1. The ordered local Markov condition is equivalent to the global Markov condition in ADMG^ 

Proof of Proposition 1: The proof is done by induction on \Xv\, with the case \Xv | = 1 being trivial. We will show that if P(Xv) 
is a probability function that factorizes according to l|5]l, as given by an ADMG Q, then P(Xy) is Markov with respect to Q. To 
prove this, first notice there must be some X v with no children in Q, since the graph is acyclic. Let Xr>i be the district of X v . By 
assumption, 

P(X V ) = P F (X v \X Dl Upag(X Dl ))xP F (X Dz \Xv\pag(X Dl )\X Dt ) 

X U^PF(X D] \pag(X D] )\X D] ) (U) 

Since X v is childless, it does not appear in any of the factors in the expression above, except for the first. Hence, 

P{X V \X V ) = P F {X Dl \Xv\pag{X Di )\X Di ) x Y[Pf(X D} | pag{X Dj )\X Dj ) (14) 

4 Notice this reduces to the standard notion of local independence in DAGs, where a vertex is independent of its (non-parental) non-descendants 
given its parents, from which d-separation statements can be derived 1 1411 1 71 - 
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which by induction hypothesis is Markov with respect to the marginal graph (G)x v \x v ( one minor detail is that (G)x v \x v might 
have more districts than Q after removing X v . However, the result still holds by further factorizing Pp(XDi\X v | pag(X_D i )\Xo i ) 
according to the newly formed districts of XD t \X v - which is possible by the construction of Pf(-) and Qi). By the ordered local 
Markov property for ADMGs and any ordering -< where X v is the last vertex, probability function P(Xv) will be Markov with 
respect to Q if, according to P(X V ), the Markov blanket of X v in Q makes X v independent of the remaining vertices. But this true 
by construction, since this Markov blanket is contained in Xo t U pag(Xr>i) according to Theorem 1. □ 

Notice that factorization ((5} is seemingly cyclical: for instance, Figure|2ja) implies the factorization Pf(Xi, X2 \ X4)Pf(Xs, X4 | X\). 
This suggests that there are additional constraints tying parameters across different factors. However, there are no such constraints, 
as guaranteed through the following result: 

Proposition 2. Given an ADMG Q with respective subgraphs {Qi} and districts {Xr>i}, any collection of probability functions 
PF^Xoi j pag^Xn^XXDi), Markov with respect to the respective Qi, implies that (Tjf is a valid probability function (a non- 
negative function that integrates to 1 ). 

Proof. It is clear that $5$ is non-negative. We have to show it integrates to 1. As in the proof of Proposition 1, first notice there must 
be some X v with no children in Q, since the graph is acyclic. Those childless vertices can be marginalized as in Equation J 1 4b if 
they do not appear on the right-hand side of any factor Pf(- | •), and removed from the graph along with all edges adjacent to them. 
After some marginalizations, suppose that in the current marginalized graph, a childless vertex X$ appears on the right-hand side 
of some factor Pf{Xfh | pag(X_D i )\Jfij i ). Because X® has no children in XD t , by construction Xd, are X v are independent 
given the remaining elements in pag (Xd ; )\Xd ; . As such, X$ can be removed from the right-hand side of all remaining factors, 
and then marginalized. The process is repeated until the last remaining vertex is marginalized, giving 1 as the result. □. 

Proposition 3. Fi(xDi \ pa-g^Xoi)) is a CDF for any choice of {{Fs(xs \ pag(Xs)}, {F v (x v \ pag(X v ))}}. If, according 
to each Fs(xs J ■)> -X* £ is independent of any element in pag(XD i )\pag(X s ), the corresponding conditional probability 
function Fi{xDi \ po.g(Xoi)) is Markov with respect to Qi. 

Proof: Each factor in ([6]l is a CDF with respect to Xn it withpag(XD ; ) fixed, and hence its product is also a CDF |9|. To show the 
Markov property, it is enough to consider the modified graph Q[ constructed by transforming all directed edges in Qi into bi-directed 
edges, since the implied distributions conditional on pag (XD t ) for Q[ and Qi are Markov equivalent [21 1. It follows directly from 
the assumptions and the properties of CDFs that disconnected sets in Q[ are marginally independent, which corresponds to the 
Markov properties of bi-directed graph Q'i [19). D 



APPENDIX B - BINARY CASE: RELATION TO COMPLETE PARAMETERIZATION 

A complete parameterization for binary ADMG models is described by |20|. As we will see, parameters are defined in the context 
of different marginals, analogous to the purely bi-directed case (5). 

As in the bi-directed case, the joint probability distribution is given by an inclusion-exclusion scheme: 

P(X v =a(V))= Y, (-l) |cv *~ 1(0)l P(X H = 1 X t aii(.H) = a(tail(H))) (15) 

C:a- 1 (0)CCCV H6[0| e 

where a(V) is a binary vector in {0, l}l Xv l and a _1 (0) is a function that indicates which elements in Xv were assigned to be 
zero. 

Each C indicates which elements are set to zero in the respective term of the summation. Depending on C, the factorization 
changes. [C]g is a set of subsets of Xv' one subset per district, each subset being barren in Q. The corresponding tail(H) is the 
Markov blanket for the ancestral set that contains H as its set of childless vertices. 

As in our discussion of standard CDNs, Equation l |15l > can be interpreted as the CDF-to-probability transformation {3}. It can 
be rewritten as 

P(Xv = a(V)) = Ec: Q -i ( o)cccv(-l) |CVt " (0)l x 

n P(X Dl \tail(H) < a(V) \ X tail(H) = a{tail{H))) 

Hence, this parameterization can also be interpreted as a CDF parameterization. One important difference is that each term in the 
summation uses only a subset of each district, XDi\tail(H) instead of Xot- Notice that some elements of Xd, appear in the 
conditioning set (i.e., tail(H) contains some of the remaining elements of , on top of the respective parents). 

The need for using subsets comes from the necessity of enforcing independence constraints entailed by bi-directed paths. As 
in the CDN model, the MCDN criterion factorizes each CDF according to its cliques as an indirect way of accounting for such 
constraints. Hence, we do not construct factorizations for different marginals: each factor within a summation term in dl5l > includes 
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all elements of each district. We enforce that they remain barren by the transformation in Section [331 — which is unnecessary in 
1201 because only barren subsets are being considered. 

To understand how the parameterizations coincide, or which constraints analogous to © emerge in our parameterization, 
consider first the following example. Using the results from |20|, the graph in Figure |2ja) needs the specification of the following 
marginals: 

P(X 1; X 4 ) = P(Xi)P(X 4 ) 
P(X 1 ,X 3 ,X 4 ) = P(X 3 ,X 4 \X 1 )P{X 1 ) 
P(X 1 ,X 2 ,X 4 ) = P(Xi,X 2 |X 4 )P(X 4 ) 
P(X 1 ,X 2 ,X 3 ,X 4 ) = P(X X ,X 2 |X 4 )P(X 3) X 4 |Xi) {b) 

P(Xi,X 3 ) = P(X 3 I Xi)P(Xi) 

P(X 2 ,X 4 ) = P(X 2 I X 4 )P{X 4 ) 

As an example, the probability P(X\ 4 — 0, X 23 = 1) = P(Xi = 0, X 2 — 1, X 3 = 1, X 4 = 0) can be derived from the above 
factorizations and dl5t as 

P(Xi = 0, X 2 = 1, X 3 = 1, X 4 = 0) 

= P(Xi < 0, X 2 < 1, X 3 < 1, X 4 < 0) - P(Xi < 0, X 2 < 1, X 3 < 0, X 4 < 0)- 
P(Xi < 0, X 2 < 0, X 3 < 1, X 4 < 0) + P(Xi <o,x 2 < 0, X 3 <0,X 4 < 0) 

= P(Xi = 0,X 4 = 0) -P(Xi = 0,X 3 = 0,X 4 = 0)- 

P(Xi = 0, X 2 = 0, X 4 = 0) + P(Xi =0,X 2 = 0, X 3 = 0, X 4 = 0) 

= P(Xx = 0)P(X 4 = 0) - P(X 34 =Q\X 1 = 0)P(X 1 = 0)- 

P(Xi2 = I X 4 = 0)P(X 4 = 0) + P(Xi 2 = j X 4 = 0)P(X 34 = I Xi = 0) 

where the last line comes from the pool of possible factorizations J16b . The corresponding probability using the MCDN parameter- 
ization is 

= P(Xi = 0, X 2 = 1 | X 4 = 0)P(X 3 = 1, X A = | Xi = 0) 



= (P(Xi <0,X 2 <1\X 4 = 0) 
{P(X 3 < l,X t <0\Xi =0) 

= (P(Xi = 0| X 4 = 0) - P(Xi 
(P(X 4 = I Xi = 0) - P(X 3 



P(Xi < 0,X 2 < I X 4 = 0))x 
P(X 3 <0,X 4 <0\Xi =0)) 

0,X 2 = I X 4 = 0))x 
0,X 4 = I Xi = 0)) 



= (P(Jfi = 0) - P(Xi = 0,X 2 = I X 4 = 0))x 
(P(X 4 = 0) - P(X 3 = 0, X 4 = I Xi = 0)) 

= P(Xi = 0)P(X 4 = 0) - P(X 34 = I X 1 = 0)P(Xi = 0)- 

P(Xi 2 = I X 4 = 0)P(X 4 = 0) + P(Xia = I X 4 = 0)P(X 34 = I Xi = 0) 

where the first line comes from the factorization of P(Xi = 0, X 2 = 1, X 3 = 1, X 4 = 0) according to ([5} and the fourth line 
comes from the Markov properties of each Qi factor. Although these parameterizations have the same high-level parameters, they 
still do not coincide, as shown in the next example. 

For a more complicated case where an extra constraint appears in our parameterization, consider FigureOa). In 1 20], it is shown 
that one of the parameters of the complete parameterization is P(Xi = 0, X 3 — | X 2 — 0, X 4 — 0, X$ = 0), which reflects 
the fact that X\ and X$ are dependent given all other variables. This also true in our case, except that according to Figure[3fc), our 
corresponding CDF is given by 

F(xi | X 2 )F(x 1 ,x 3 )F(x 2 ,x 3 )F(x 3 ,x 4 )F(x 4 ,x 5 )F(x 3 \ X 5 )F(x 2 \ X 4 ) 

which, evaluated at Xi 234 s = 0, gives 

P(Xi = | X 2 = 0)P(Xi = 0, X 3 = 0)P(X 2 = 0, X 3 = 0)P(X 3 = 0, X 4 = 0)x 

P(X 4 = 0, X 5 = 0)P(X 3 = | X 5 = 0)P(X 2 = | X 4 = 0) 
implying that P(Xi 234 s = 0) factorizes as f(X\,X 2 , X 3 , X 4 )g(X 2 , X 3 ,X 4 , X$), the generalization to l[4}. 
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